Practical Pandas 02: Exporatory Data Analysis

Welcome back. As a reminder, we've got a dataset with my cycling data from last year merged and stored in an HDF5 store. Today we'll use pandas, seaborn, and matplotlib to do some exploratory data analysis.



In [2]:

    
%matplotlib inline

import os
import datetime

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



In [3]:

    
df = pd.read_hdf(os.path.join('data', 'cycle_store.h5'), key='merged')
df.head()









    Out[3]:






  
    
      
      Time
      Ride Time
      Ride Time (secs)
      Stopped Time
      Stopped Time (secs)
      Latitude
      Longitude
      Elevation (feet)
      Distance (miles)
      Speed (mph)
      Pace
      Pace (secs)
      Average Speed (mph)
      Average Pace
      Average Pace (secs)
      Ascent (feet)
      Descent (feet)
      Calories
      ride_id
    
  
  
    
      0
      2013-08-01 07:07:10
      2014-08-25 00:00:01
        1.1
      2014-08-25
       0
       41.703753
      -91.609892
       963
       0.00
       2.88
      2014-08-25 00:20:51
       1251
       0.00
      2014-08-25 00:00:00
          0
       0
         0
       0
       0
    
    
      1
      2013-08-01 07:07:17
      2014-08-25 00:00:08
        8.2
      2014-08-25
       0
       41.703825
      -91.609835
       852
       0.01
       2.88
      2014-08-25 00:20:51
       1251
       2.56
      2014-08-25 00:23:27
       1407
       0
       129
       0
       0
    
    
      2
      2013-08-01 07:07:22
      2014-08-25 00:00:13
       13.2
      2014-08-25
       0
       41.703858
      -91.609814
       789
       0.01
       2.88
      2014-08-25 00:20:51
       1251
       2.27
      2014-08-25 00:26:27
       1587
       0
       173
       0
       0
    
    
      3
      2013-08-01 07:07:27
      2014-08-25 00:00:18
       18.2
      2014-08-25
       0
       41.703943
      -91.610090
       787
       0.02
       6.60
      2014-08-25 00:09:06
        546
       4.70
      2014-08-25 00:12:47
        767
       0
       173
       1
       0
    
    
      4
      2013-08-01 07:07:40
      2014-08-25 00:00:31
       31.2
      2014-08-25
       0
       41.704381
      -91.610258
       788
       0.06
       9.50
      2014-08-25 00:06:19
        379
       6.37
      2014-08-25 00:09:26
        566
       0
       173
       2
       0

The columns should be pretty obvious. The pair (ride_id, Time) form a unique index for the DataFrame. We may set them to be the MulitIndex later. Pace is just $\frac{time}{distance}$.

Let's drop some of the redundent columns and do a bit of renaming.



In [4]:

    
df = df.drop(['Ride Time', 'Stopped Time', 'Pace', 'Average Pace'], axis=1)

def renamer(name):
    for char in ['(', ')']:
        name = name.replace(char, '')
    name = name.replace(' ', '_')
    name = name.lower()
    return name

df = df.rename(columns=renamer)

Timelines

Using a snazzy new feature coming out in pandas 0.15, we can easily access the time attributes of the the datetime columns under the .dt namespace.



In [5]:

    
sub = df[['time', 'distance_miles', 'ride_id']].copy()
sub['time'] = sub['time'].dt.time
sub = sub.sort(columns='time')

Let's break them into two parts, morning and afternoon.



In [6]:

    
morning = sub[sub.time < datetime.time(12)]
evening = sub[sub.time >= datetime.time(12)]



In [7]:

    
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(morning.time.values, morning.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_ylim(0, 8)









    Out[7]:





(0, 8)



In [8]:

    
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(evening.time.values, evening.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_ylim(0, 8)









    Out[8]:





(0, 8)

Fun. The horizontal distance is the length of time it took me to make the ride. I like this chart because it also conveys the start time of each ride. The plot shows that the morning ride typically took longer, but we can verify that.



In [9]:

    
is_morning = df.time.dt.time < datetime.time(12, 0, 0)

ride_time = df.groupby(['ride_id', is_morning])['ride_time_secs'].agg('max')
mean_time = ride_time.groupby(level=1).mean().rename(
    index={True: 'morning', False: 'evening'})
mean_time / 60









    Out[9]:





evening    30.761667
morning    29.362716
Name: ride_time_secs, dtype: float64

So the morning ride is typically shorter! But I think I know what's going on. We were misleading with our plot earlier since the range of the horizontal axis weren't identical. Always check the axis!



In [10]:

    
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(morning.time.values, morning.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_xlim(datetime.time(6, 40), datetime.time(10, 40))
ax.set_ylim(0, 8)









    Out[10]:





(0, 8)



In [11]:

    
fig, ax = plt.subplots(figsize=(12, 5))
ax.scatter(evening.time.values, evening.distance_miles, marker='.',
           color='k', linewidths=0.01, alpha=.5)
ax.set_xlim(datetime.time(15, 30), datetime.time(19, 30))
ax.set_ylim(0, 8)









    Out[11]:





(0, 8)



In [12]:

    
axes = df.groupby('ride_id').plot(x='distance_miles', y='elevation_feet', color='k', alpha=.5)



In [13]:

    
df['is_morning'] = df.time.dt.time < datetime.time(12)



In [14]:

    
sns.tsplot(df, time="distance_miles", unit="ride_id", value="elevation_feet")









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-34c2210b4fa5> in <module>()
----> 1 sns.tsplot(df, time="distance_miles", unit="ride_id", value="elevation_feet")

/Users/tom/Envs/py3/lib/python3.4/site-packages/seaborn-0.4.dev-py3.4.egg/seaborn/timeseries.py in tsplot(data, time, unit, condition, value, err_style, ci, interpolate, color, estimator, n_boot, err_palette, err_kws, legend, ax, **kwargs)
    201     for c, (cond, df_c) in enumerate(data.groupby(condition, sort=False)):
    202 
--> 203         df_c = df_c.pivot(unit, time, value)
    204         x = df_c.columns.values.astype(np.float)
    205 

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/frame.py in pivot(self, index, columns, values)
   3273         """
   3274         from pandas.core.reshape import pivot
-> 3275         return pivot(self, index=index, columns=columns, values=values)
   3276 
   3277     def stack(self, level=-1, dropna=True):

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in pivot(self, index, columns, values)
    358                          index=MultiIndex.from_arrays([self[index],
    359                                                        self[columns]]))
--> 360         return indexed.unstack(columns)
    361 
    362 

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/series.py in unstack(self, level)
   1879         """
   1880         from pandas.core.reshape import unstack
-> 1881         return unstack(self, level)
   1882 
   1883     #----------------------------------------------------------------------

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in unstack(obj, level)
    432             return obj.T.stack(dropna=False)
    433     else:
--> 434         unstacker = _Unstacker(obj.values, obj.index, level=level)
    435         return unstacker.get_result()
    436 

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in __init__(self, values, index, level, value_columns)
     96 
     97         self._make_sorted_values_labels()
---> 98         self._make_selectors()
     99 
    100     def _make_sorted_values_labels(self):

/Users/tom/Envs/py3/lib/python3.4/site-packages/pandas-0.14.1_283_g66630f4-py3.4-macosx-10.9-x86_64.egg/pandas/core/reshape.py in _make_selectors(self)
    138 
    139         if mask.sum() < len(self.index):
--> 140             raise ValueError('Index contains duplicate entries, '
    141                              'cannot reshape')
    142 

ValueError: Index contains duplicate entries, cannot reshape



In [ ]:

    
df.groupby('ride_id').plot(x='distance_miles', y='elevation_feet')



In [ ]:

    
sns.tsplot(df, time="ride_time_secs", value="distance_miles", condition="is_morning", unit="ride_id")



In [15]:









    Out[15]:






  
    
      
      time
      ride_time_secs
      stopped_time_secs
      latitude
      longitude
      elevation_feet
      distance_miles
      speed_mph
      pace_secs
      average_speed_mph
      average_pace_secs
      ascent_feet
      descent_feet
      calories
      ride_id
      is_morning
    
  
  
    
      0    
      2013-08-01 07:07:10
          1.1
         0.0
       41.703753
      -91.609892
       963
       0.00
        2.88
       1251
        0.00
          0
         0
         0
         0
        0
       True
    
    
      1    
      2013-08-01 07:07:17
          8.2
         0.0
       41.703825
      -91.609835
       852
       0.01
        2.88
       1251
        2.56
       1407
         0
       129
         0
        0
       True
    
    
      2    
      2013-08-01 07:07:22
         13.2
         0.0
       41.703858
      -91.609814
       789
       0.01
        2.88
       1251
        2.27
       1587
         0
       173
         0
        0
       True
    
    
      3    
      2013-08-01 07:07:27
         18.2
         0.0
       41.703943
      -91.610090
       787
       0.02
        6.60
        546
        4.70
        767
         0
       173
         1
        0
       True
    
    
      4    
      2013-08-01 07:07:40
         31.2
         0.0
       41.704381
      -91.610258
       788
       0.06
        9.50
        379
        6.37
        566
         0
       173
         2
        0
       True
    
    
      5    
      2013-08-01 07:07:42
         33.2
         0.0
       41.704459
      -91.610286
       792
       0.06
        9.67
        372
        6.59
        547
         0
       173
         3
        0
       True
    
    
      6    
      2013-08-01 07:07:54
         45.2
         0.0
       41.704936
      -91.610316
       778
       0.09
       10.23
        352
        7.46
        482
         0
       173
         4
        0
       True
    
    
      7    
      2013-08-01 07:08:01
         52.2
         0.0
       41.705232
      -91.610316
       791
       0.11
       10.68
        337
        7.87
        457
         0
       173
         5
        0
       True
    
    
      8    
      2013-08-01 07:08:03
         54.2
         0.0
       41.705337
      -91.610248
       799
       0.12
       11.39
        316
        8.11
        444
         0
       173
         6
        0
       True
    
    
      9    
      2013-08-01 07:08:14
         65.2
         0.0
       41.705797
      -91.610270
       779
       0.15
       10.57
        341
        8.51
        423
         0
       173
         7
        0
       True
    
    
      10   
      2013-08-01 07:08:24
         75.2
         0.0
       41.706280
      -91.610187
       790
       0.19
       10.24
        352
        8.98
        401
         0
       173
        10
        0
       True
    
    
      11   
      2013-08-01 07:08:34
         85.2
         0.0
       41.706742
      -91.610131
       793
       0.22
       11.56
        311
        9.28
        388
         0
       173
        12
        0
       True
    
    
      12   
      2013-08-01 07:08:43
         94.2
         0.0
       41.707208
      -91.610148
       799
       0.25
       12.68
        284
        9.62
        374
         0
       173
        15
        0
       True
    
    
      13   
      2013-08-01 07:08:50
        101.6
         0.0
       41.707687
      -91.610108
       792
       0.28
       15.81
        228
       10.09
        357
         0
       173
        17
        0
       True
    
    
      14   
      2013-08-01 07:08:57
        108.2
         0.0
       41.708137
      -91.610112
       795
       0.32
       16.42
        219
       10.51
        343
         0
       173
        18
        0
       True
    
    
      15   
      2013-08-01 07:09:00
        111.4
         0.0
       41.708287
      -91.610094
       799
       0.33
       15.21
        237
       10.55
        341
         0
       173
        20
        0
       True
    
    
      16   
      2013-08-01 07:09:02
        113.3
         0.0
       41.708368
      -91.609948
       800
       0.34
       15.14
        238
       10.67
        337
         0
       173
        20
        0
       True
    
    
      17   
      2013-08-01 07:09:11
        122.3
         0.0
       41.708453
      -91.609323
       810
       0.37
       15.24
        236
       10.85
        332
         0
       173
        24
        0
       True
    
    
      18   
      2013-08-01 07:09:19
        130.3
         0.0
       41.708503
      -91.608707
       816
       0.40
       12.86
        280
       11.07
        325
         0
       173
        28
        0
       True
    
    
      19   
      2013-08-01 07:09:23
        134.3
         0.0
       41.708511
      -91.608398
       819
       0.42
       12.05
        299
       11.17
        322
         0
       173
        29
        0
       True
    
    
      20   
      2013-08-01 07:09:43
        154.3
         0.0
       41.708425
      -91.607769
       834
       0.45
       10.29
        350
       10.49
        343
        41
       173
        36
        0
       True
    
    
      21   
      2013-08-01 07:09:51
        162.3
         0.0
       41.708439
      -91.607102
       845
       0.48
       14.85
        242
       10.74
        335
        41
       173
        40
        0
       True
    
    
      22   
      2013-08-01 07:09:59
        170.9
         0.0
       41.708440
      -91.606433
       839
       0.52
       15.58
        231
       10.93
        329
        41
       173
        40
        0
       True
    
    
      23   
      2013-08-01 07:10:06
        177.4
         0.0
       41.708439
      -91.605821
       820
       0.55
       16.04
        224
       11.17
        322
        41
       173
        41
        0
       True
    
    
      24   
      2013-08-01 07:10:13
        184.7
         0.0
       41.708440
      -91.605220
       816
       0.58
       16.04
        224
       11.33
        318
        41
       173
        41
        0
       True
    
    
      25   
      2013-08-01 07:10:21
        192.3
         0.0
       41.708515
      -91.604569
       809
       0.62
       15.34
        235
       11.52
        312
        41
       184
        42
        0
       True
    
    
      26   
      2013-08-01 07:10:28
        199.2
         0.0
       41.708662
      -91.603935
       800
       0.65
       17.13
        210
       11.74
        307
        41
       200
        42
        0
       True
    
    
      27   
      2013-08-01 07:10:35
        206.2
         0.0
       41.708785
      -91.603353
       796
       0.68
       16.71
        215
       11.89
        303
        41
       209
        43
        0
       True
    
    
      28   
      2013-08-01 07:10:41
        212.3
         0.0
       41.708857
      -91.602961
       799
       0.70
       13.62
        264
       11.91
        302
        41
       209
        43
        0
       True
    
    
      29   
      2013-08-01 07:10:51
        222.3
         0.0
       41.708393
      -91.602801
       806
       0.74
       12.03
        299
       11.90
        302
        41
       209
        47
        0
       True
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      15961
      2013-10-10 07:29:02
       1463.3
       185.0
       41.661385
      -91.542541
       755
       6.47
       10.47
        344
       15.92
        226
       203
       328
       306
       44
       True
    
    
      15962
      2013-10-10 07:29:03
       1464.2
       185.0
       41.661333
      -91.542399
       748
       6.48
       13.38
        269
       15.93
        226
       203
       328
       306
       44
       True
    
    
      15963
      2013-10-10 07:29:05
       1466.2
       185.0
       41.661334
      -91.542143
       735
       6.49
       17.82
        202
       15.94
        226
       203
       328
       306
       44
       True
    
    
      15964
      2013-10-10 07:29:06
       1467.3
       185.0
       41.661261
      -91.542036
       729
       6.50
       19.88
        181
       15.95
        226
       203
       339
       306
       44
       True
    
    
      15965
      2013-10-10 07:29:11
       1472.3
       185.0
       41.661266
      -91.541427
       703
       6.53
       23.86
        151
       15.97
        225
       203
       377
       306
       44
       True
    
    
      15966
      2013-10-10 07:29:16
       1477.2
       185.0
       41.661271
      -91.540820
       692
       6.56
       23.10
        156
       16.00
        225
       203
       377
       306
       44
       True
    
    
      15967
      2013-10-10 07:29:21
       1482.4
       185.0
       41.661274
      -91.540171
       682
       6.60
       23.27
        155
       16.02
        225
       203
       377
       307
       44
       True
    
    
      15968
      2013-10-10 07:29:26
       1487.3
       185.0
       41.661277
      -91.539543
       674
       6.63
       23.72
        152
       16.05
        224
       203
       405
       307
       44
       True
    
    
      15969
      2013-10-10 07:29:32
       1493.5
       185.0
       41.661281
      -91.538852
       677
       6.67
       22.03
        163
       16.07
        224
       203
       410
       309
       44
       True
    
    
      15970
      2013-10-10 07:29:41
       1502.2
       185.0
       41.661285
      -91.538190
       682
       6.70
       12.29
        293
       16.06
        224
       203
       410
       312
       44
       True
    
    
      15971
      2013-10-10 07:29:51
       1512.2
       185.0
       41.661295
      -91.537755
       679
       6.72
        8.48
        424
       16.01
        225
       203
       410
       313
       44
       True
    
    
      15972
      2013-10-10 07:29:54
       1515.3
       185.0
       41.661505
      -91.537672
       678
       6.74
       12.59
        286
       16.01
        225
       203
       410
       313
       44
       True
    
    
      15973
      2013-10-10 07:30:01
       1522.3
       185.0
       41.661987
      -91.537653
       677
       6.77
       17.34
        208
       16.01
        225
       203
       410
       314
       44
       True
    
    
      15974
      2013-10-10 07:30:02
       1523.3
       185.0
       41.662019
      -91.537652
       676
       6.77
       15.92
        226
       16.01
        225
       203
       410
       314
       44
       True
    
    
      15975
      2013-10-10 07:30:03
       1524.3
       185.0
       41.662041
      -91.537593
       676
       6.78
       14.54
        248
       16.01
        225
       203
       410
       314
       44
       True
    
    
      15976
      2013-10-10 07:30:04
       1525.2
       185.0
       41.662037
      -91.537666
       676
       6.78
       14.54
        248
       16.00
        225
       203
       410
       314
       44
       True
    
    
      15977
      2013-10-10 07:30:05
       1526.2
       185.0
       41.662106
      -91.537775
       676
       6.79
       16.14
        223
       16.01
        225
       203
       410
       314
       44
       True
    
    
      15978
      2013-10-10 07:30:06
       1527.2
       185.0
       41.662124
      -91.537731
       677
       6.79
       14.84
        243
       16.01
        225
       203
       410
       314
       44
       True
    
    
      15979
      2013-10-10 07:30:07
       1528.2
       185.0
       41.662192
      -91.537682
       678
       6.80
       14.90
        242
       16.01
        225
       203
       410
       314
       44
       True
    
    
      15980
      2013-10-10 07:30:10
       1531.4
       185.0
       41.662320
      -91.537746
       682
       6.81
       13.33
        270
       16.00
        225
       203
       410
       315
       44
       True
    
    
      15981
      2013-10-10 07:30:13
       1534.4
       185.0
       41.662452
      -91.537677
       684
       6.82
       13.77
        261
       15.99
        225
       203
       410
       316
       44
       True
    
    
      15982
      2013-10-10 07:30:15
       1536.2
       185.0
       41.662497
      -91.537527
       685
       6.82
       14.14
        255
       15.99
        225
       203
       410
       316
       44
       True
    
    
      15983
      2013-10-10 07:30:31
       1552.3
       185.0
       41.662616
      -91.536922
       692
       6.86
        6.70
        537
       15.90
        226
       203
       410
       319
       44
       True
    
    
      15984
      2013-10-10 07:30:53
       1574.3
       185.0
       41.662583
      -91.536315
       707
       6.89
        5.50
        654
       15.75
        229
       231
       410
       321
       44
       True
    
    
      15985
      2013-10-10 07:30:57
       1578.2
       185.0
       41.662569
      -91.536220
       710
       6.89
        5.00
        721
       15.72
        229
       240
       410
       321
       44
       True
    
    
      15986
      2013-10-10 07:31:00
       1581.3
       185.0
       41.662641
      -91.535916
       712
       6.91
        9.28
        388
       15.73
        229
       240
       410
       322
       44
       True
    
    
      15987
      2013-10-10 07:31:15
       1588.2
       193.1
       41.662581
      -91.535310
       722
       6.94
        7.01
        514
       15.73
        229
       240
       410
       323
       44
       True
    
    
      15988
      2013-10-10 07:31:29
       1588.2
       207.1
       41.662550
      -91.535287
       722
       6.94
        0.98
       3677
       15.74
        229
       240
       410
       323
       44
       True
    
    
      15989
      2013-10-10 07:31:50
       1588.2
       227.6
       41.662586
      -91.535258
       722
       6.95
        0.86
       4174
       15.74
        229
       240
       410
       323
       44
       True
    
    
      15990
      2013-10-10 07:31:52
       1588.2
       230.0
       41.662586
      -91.535258
       722
       6.95
        0.51
       7074
       15.74
        229
       240
       410
       323
       44
       True
    
  

15991 rows × 16 columns



In [ ]:



In [ ]:



In [ ]:

	Time	Ride Time	Ride Time (secs)	Stopped Time	Latitude	Longitude	Elevation (feet)	Distance (miles)	Speed (mph)	Pace	Pace (secs)	Average Speed (mph)	Average Pace	Average Pace (secs)	Descent (feet)	Calories
0	2013-08-01 07:07:10	2014-08-25 00:00:01	1.1	2014-08-25	41.703753	-91.609892	963	0.00	2.88	2014-08-25 00:20:51	1251	0.00	2014-08-25 00:00:00	0	0	0
1	2013-08-01 07:07:17	2014-08-25 00:00:08	8.2	2014-08-25	41.703825	-91.609835	852	0.01	2.88	2014-08-25 00:20:51	1251	2.56	2014-08-25 00:23:27	1407	129	0
2	2013-08-01 07:07:22	2014-08-25 00:00:13	13.2	2014-08-25	41.703858	-91.609814	789	0.01	2.88	2014-08-25 00:20:51	1251	2.27	2014-08-25 00:26:27	1587	173	0
3	2013-08-01 07:07:27	2014-08-25 00:00:18	18.2	2014-08-25	41.703943	-91.610090	787	0.02	6.60	2014-08-25 00:09:06	546	4.70	2014-08-25 00:12:47	767	173	1
4	2013-08-01 07:07:40	2014-08-25 00:00:31	31.2	2014-08-25	41.704381	-91.610258	788	0.06	9.50	2014-08-25 00:06:19	379	6.37	2014-08-25 00:09:26	566	173	2